========================================================
## [1] 1599 13
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol quality
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40 3: 10
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 4: 53
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20 5:681
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42 6:638
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 7:199
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90 8: 18
Our dataset consists of 13 variables, with 1599 observations. There are no null values in all columns and we have changed the quality to be factor instead of integer and removed X column from the dataset. In the end we are left with 12 variables.
From the above, we can see that the quality of red wine data we have spanned from 3 to 8. We can also see that most of the red wine data we have have quality of 5 or 6. There are very small number of data where the quality is 8 or 3. We also try to have more detailed scale in the y direction to see how imbalance the quality data is. From there we can observed that there are around 40 wine data with the quality score 4.
From the above we can see that most of the red wine have pH of 3.2 to 3.4. Also, the pH plot above looks like a normal distribution. Moreover, there seems to be outliers for the pH level, those that are above 4 and below 2.75.
From the alcohol data we can see that most wine has alcohol percentage of 9%. We can also observed that most of the wine has density of 0.995 to 1 and that the density distribution looks like a normal distribution.
It seems that most of the data has citric acid of near 0. However, we also see quite a considerable amount around 0.5. The graph above looks right skewed, we may need to use log transformation later.
The most common fixed acidity value lies between 6 - 8, while the most common volatile acidity value lies between 0.4 to 0.8.
For residual sugar, the most common value lies between 1 - 3 g/dm^3. For sulphates, the most common value lies between 0.5 to 1 g/dm^3.
We can also see that there are several data with residual sugar measurement more than 8 and total sulfur dioxide of more than 200. These could potentially be an outlier.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
For free sulfur dioxide, the most commmon value lies between 0 to 15 mg/dm^3. For total sulfur dioxide, the most common value lies between 0 to 50 mg/dm^3. After that, we also create the variable free sulful dioxide percentage which is the free sulfur dioxide divided by the total sulfur dioxide. All the value are between 0 to 1.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Most wine has chlorides of around 0.1 g/dm^3.
Our dataset consists of 13 variables, with 1599 observations. All of the variables are number. - Most of the red wine has quality of 5 - 6. - Most of the variables, like chlorides, sulphates, volatile acidity seems to have normal distribution. - The citric acid variable seems to be skewed to the right. We may need to use log transformation later for it.
The main features in the data set are alcohol and quality. I’d like to determine which features are best for predicting the quality of a red wine. I suspect alcohol and some combination of the other variables can be used to build a predictive model to determine the quality of red wine.
Some other variables that indicate the taste of the wine, such as acidity, sugar , and density are likely to affect the quality of the red wine. I think density and alcohol contribute most to the quality after researching information on wine quality.
I create free sulfur dioxide percentage that measure the percentage of free sulfur dioxide among all sulfur dioxide in the dataset.
I checked each column to see if there is any null value, but apparently there is not. I also change the quality to be a factor instead of number.
From the plot above, we can see that alcohol, sulphates, and volatile acidity seem to have weak correlation with quality. We can also clarify our previous findings that some of the variables have right skewed distribution.
However, it is interesting to see that there are strong correlation between other 2 variables, such as fixed acidity vs citric acid, volatile acidity vs citric acid, fixed acidity vs density, fixed acidity vs pH. Even though we are interested in only what factors correlate with the quality, it is also important to take note which pair of variables seems to be correlated, especially if later we would like to do a linear regression.
I want to look closer at the scatter plots involving quality and some other variable like alcohol, volatile acidity and sulphates.
Since it is typically not a good idea to use scatter plot for discrete data, we need to use jitter to avoid overplotting of the plot due to points in visualizatoins plotting on top of each other.
##
## Call:
## lm(formula = as.numeric(quality) ~ alcohol, data = wineData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8442 -0.4112 -0.1690 0.5166 2.5888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.12503 0.17471 -0.716 0.474
## alcohol 0.36084 0.01668 21.639 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared: 0.2267, Adjusted R-squared: 0.2263
## F-statistic: 468.3 on 1 and 1597 DF, p-value: < 2.2e-16
## wineData$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## --------------------------------------------------------
## wineData$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## wineData$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## wineData$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## wineData$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## wineData$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
After adding jitter to the plot, we can see that on most cases, the red wine alcohol quantity seems to correlate to the quality of the wine. The R-squared value shows that alcohol explain about 22.63 percent of the quality of the wine.
Let’s try the same thing for the volatile acidity
## wineData$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## --------------------------------------------------------
## wineData$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## --------------------------------------------------------
## wineData$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## --------------------------------------------------------
## wineData$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## --------------------------------------------------------
## wineData$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## --------------------------------------------------------
## wineData$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
After setting the transparency, there seems to be a weak negative linear correlation between volatile acidity and quality. It is becoming more clear when we plot it as a boxplot. Nearly 50% of the wine with lowest quality seems to have higher volatile acidity than other red wine with greater quality. In fact, volatile acidity explains 15% of the quality of red wine. We can see that as quality increase, the mean and median of red wine belonging to that quality decrease. From the boxplot, since most of the value range overlap each other for different quality, volatile acidity may only be used to differentiate wine of quality 3 and wine of quality 7 to 8.
##
## Call:
## lm(formula = as.numeric(quality) ~ sulphates, data = wineData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2432 -0.5424 0.1102 0.4456 2.3977
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.84775 0.07842 36.31 <2e-16 ***
## sulphates 1.19771 0.11539 10.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7819 on 1597 degrees of freedom
## Multiple R-squared: 0.0632, Adjusted R-squared: 0.06261
## F-statistic: 107.7 on 1 and 1597 DF, p-value: < 2.2e-16
We can also see correlation of sulphates vs quality and that the median of the sulphates content in the wine increase as quality increase. However, there seems to be considerable amount of outlier when the quality is 5 and 6. It may be because of the fact that we have a lot more data for wine of quality 5 and 6 or it could be an indication that sulphates might not be a strong indicator of quality. In fact, sulphates explains only about 6% of quality.
From the boxplot and scatterplot we draw above, it does not seem that free sulfur dioxide has linear correlation with quality.
Let’s try to draw the correlation between fixed acidity and density
##
## Call:
## lm(formula = density ~ fixed.acidity, data = wineData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0064452 -0.0007700 0.0000738 0.0009434 0.0055816
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.907e-01 1.716e-04 5774.70 <2e-16 ***
## fixed.acidity 7.242e-04 2.018e-05 35.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001405 on 1597 degrees of freedom
## Multiple R-squared: 0.4463, Adjusted R-squared: 0.4459
## F-statistic: 1287 on 1 and 1597 DF, p-value: < 2.2e-16
It seems like there is weak linear relation between fixed acidity and density based on the graph. In fact, fixed acidity explains 45% of density.
We know in fact, at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine, while at low concentration, it is undetectable. Let’s find out how many sample we have had high concentration of SO2 and its summary for each of quality value.
## Mode FALSE TRUE
## logical 1583 16
## wineData$quality: 3
## Mode FALSE
## logical 10
## --------------------------------------------------------
## wineData$quality: 4
## Mode FALSE
## logical 53
## --------------------------------------------------------
## wineData$quality: 5
## Mode FALSE TRUE
## logical 672 9
## --------------------------------------------------------
## wineData$quality: 6
## Mode FALSE TRUE
## logical 633 5
## --------------------------------------------------------
## wineData$quality: 7
## Mode FALSE TRUE
## logical 197 2
## --------------------------------------------------------
## wineData$quality: 8
## Mode FALSE
## logical 18
It seems that in our sample data, only red wine of quality 5 - 7 have sample of wine with high concentration of S02. However, the number of cases itself is small compared to the sample size we have, thus it may not be wise to assume that in general, when there is high concentration of SO2, the red wine is of quality 5 - 7.
Let’s look at the correlation of fixed acidity vs pH
##
## Call:
## lm(formula = pH ~ fixed.acidity, data = wineData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51780 -0.06547 0.00164 0.06488 0.52207
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.814959 0.013776 276.93 <2e-16 ***
## fixed.acidity -0.060561 0.001621 -37.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1128 on 1597 degrees of freedom
## Multiple R-squared: 0.4665, Adjusted R-squared: 0.4661
## F-statistic: 1396 on 1 and 1597 DF, p-value: < 2.2e-16
It seems that there is weak negative correlation between pH and fixed acidity. It makes sense as when the wine contains more acids, it has less pH.
Let’s see the correlation between volatile acidity and alcohol to determine if we can use both variable to predict the quality of the wine.
##
## Call:
## lm(formula = volatile.acidity ~ alcohol, data = wineData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3692 -0.1292 -0.0084 0.1007 1.0684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.882094 0.043142 20.446 < 2e-16 ***
## alcohol -0.033990 0.004118 -8.255 3.16e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1754 on 1597 degrees of freedom
## Multiple R-squared: 0.04092, Adjusted R-squared: 0.04032
## F-statistic: 68.14 on 1 and 1597 DF, p-value: 3.155e-16
As we can see from above, the adjusted R-squared between volatile acidity and alcohol is small and this indicate that we can use both variable together to predict the quality of the wine.
Price correlates weakly with volatile acidity and alcohol.
When alcohol percentage increases, the quality tends to increase. However, as quality increase, the alcohol percentage variance also increase.
Based on the R^2 value, alcohol explains about 23% of the variance in quality. Other feature of interest could be incorporated into the model to explain the variance in quality.
Red wine with high volatile acidity concentrate tend to have lower quality. Wine with quality of 3 all have volatile acidity concentrate higher than 0.7 g/dm^3 and most of wine with quality greater than 6 have volatile acidity concentrate lower than 0.5 g/dm^3. I suppose it is because high acidity lead to unpleasant taste.
Moreover, wine with higher concentration of sulphates tend to have higher quality.
We can see stronger correlation between fixed acidity vs density and between fixed acidity and pH.
The relation between fixed acidity and pH. As for the main feature, it will be that the quality of wine strongly correlated with alcohol and volatile acidity. Moreover since alcohol and volatile acidity does not seem to have correlation, we can use both variable later to predict the quality of the wine.
In the plot above we plot the graph of median of volatile acidity vs alcohol for each quality value. In the graph above we can see that as quality decrease to 1, the volatile acidity tend to increase for some value of alcohol. However, it does not hold for all alcohol percentage, especially when the alcohol percentage is above 12. From the above graph, we can also see that each line looks random which show that there is no linear relation between alcohol and volatile acidity.
In the plot above, it seems that even though there is correlation between fixed acidity and pH, there is no clear separation of the color of point in the plot. It shows that both variable combined does not correlate to quality
This plot looks a bit better than the previous one as in there, we can see a bit of grouping in the location of each color.
##
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = wineData)
## m2: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity,
## data = wineData)
## m3: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity +
## sulphates, data = wineData)
## m4: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity +
## sulphates + citric.acid, data = wineData)
## m5: lm(formula = as.numeric(quality) ~ alcohol + volatile.acidity +
## sulphates + citric.acid + density, data = wineData)
##
## ==========================================================================================
## m1 m2 m3 m4 m5
## ------------------------------------------------------------------------------------------
## (Intercept) -0.125 1.095*** 0.611** 0.646** -14.504
## (0.175) (0.184) (0.196) (0.201) (11.964)
## alcohol 0.361*** 0.314*** 0.309*** 0.309*** 0.323***
## (0.017) (0.016) (0.016) (0.016) (0.019)
## volatile.acidity -1.384*** -1.221*** -1.265*** -1.301***
## (0.095) (0.097) (0.113) (0.116)
## sulphates 0.679*** 0.696*** 0.680***
## (0.101) (0.103) (0.104)
## citric.acid -0.079 -0.155
## (0.104) (0.120)
## density 15.106
## (11.927)
## ------------------------------------------------------------------------------------------
## R-squared 0.227 0.317 0.336 0.336 0.337
## adj. R-squared 0.226 0.316 0.335 0.334 0.335
## sigma 0.710 0.668 0.659 0.659 0.659
## F 468.267 370.379 268.912 201.777 161.803
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1621.814 -1599.384 -1599.093 -1598.288
## Deviance 805.870 711.796 692.105 691.852 691.157
## AIC 3448.114 3251.628 3208.768 3210.186 3210.576
## BIC 3464.245 3273.136 3235.654 3242.448 3248.216
## N 1599 1599 1599 1599 1599
## ==========================================================================================
Alcohol and volatile acidity seems to strengthen each other. Moreover, density seems to also strengthen alcohol.
The interaction between alcohol and density seems to able to contribute to the quality of the wine. However, it is not strong enough as some different quality value still override each other in the scatter plot we draw.
Yes, I created 5 model to predict the quality of the wine. At first, I only use alcohol, in which I obtain 0.226 as the adjusted R-squared. After that, I added volatile acidity, after which the adjusted R-squared increase a lot while making all the variables as still important. After that, I added sulphates, which also increase the R-squared, while keeping all variable important. After that, I added citric acid and density, after which the R-squared did not improved a lot and the variables added is marked as unimportant.
From the plot above, we can see that our data of wine quality is imbalanced. There are a lot more ordinary wine (those with quality of 5 - 6) than the good ones ( quality of 7- 8 ) or the bad ones (3 - 4). It is probably because the ordinary wine is the most popular one as it is not as expensive as the good ones but is still tasty.
In this plot, there seems to be a tendency for alcohol percentage to increase as the quality of wine increase. However, there is some anomality when the quality is 5, in which the alcohol percentage average is lower than when the quality is 4. Moreover, there are also some outlier spotted when the quality is 5.
The plot above shows that it may be possible to predict quality using alcohol and volatile acidity. It is because when comparing the line for low quality and high quality, we can see clear separation, e.g when quality is 3 vs 8. Thus, we can see that it would be easy to separate low quality wine from the high quality ones. However, the relationship may not be a linear. When we want to categorise the wine, it may be better to use logistic regression or SVM or K-means.
The red wine data set contains information on almost 2000 thousand red wines across 12 variables from around 2009 I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the quality of red wine across many variables and created a linear model to predict the quality.
There was a clear trend between the alcohol or volatile acidity of wine and its quality. I was surprised that pH or free sulfur dioxide did not have a strong negative correlation with quality, but these variables are likely to be represented by sulphates. I struggled understanding the outliers in the box plot that usually occured when the quality is 5 or 6, but this became more clear when I realized that most of the data has quality of 5 to 6. For the linear model, all red wine were included since information on quality, volatile acidity, alcohol, sulphates, citric acidity, and density were available for all row. After fitting the linear model without transforming the variables, the model was able to account for 33.7% of the variance in the dataset.
Some limitation of this model include the fact that we use linear regression to fit a factor variable. In this case, I think it may be better to separate the quality into 2 factor, good or bad and fit a logistic regression / SVM / K-means model into the data. It is because it is easier to separate good and bad wine first before we mark the quality of the wine.